Avanced Predictive Analytics Project
Executive Summary

This project examines the model-specific fuel consumption ratings for new light-duty vehicles for retail sale and its estimated carbon dioxide emission in Canada in 2022-2023. The goals of this analysis to try to predict the CO2 emissions and well as CO2 Rating and Smog Rating for these vehicles. For this analysis, we first perform some exploratory data analysis(EDA) to understand the distribution of the variables and look for relationships among predictors as well as among predictors and the response variables.

After the EDA, we perform Regression analysis to predict the CO2 emissions for the vehicles using Linear Regression, Regression Trees, Random Forest and Boosted Trees. Second, we will perform Classification analysis to predict CO2 rating of these vehicles using Classification trees and Logistic Regression. An additional Logistic regression model was also created to predict Smog Ratings, our second response variable. Finally, we summarize the findings from the analysis in our conclusion.

The best regression model to predict CO2 emissions was the regression tree with a R-squared of 73% and the best classification model was the classification tree with a Sensitivity of 86%*.

We find that the variables that decrease CO2 emissions are:

  • Vehicles manufactured by Toyota
  • Vehicles with AV(continuous variation) transmission
  • Vehicles using ethanol fuel”
  • Vehicles using regular gasoline fuel

The variables that increase CO2 emissions are:

  • Increase in Engine size of vehicles
  • Vehicles manufactured by Ford and Porsche
  • Vehicle class such as Mini-comapct, Pickup truck, sport utility, SUV,& Two Seater
The Problem Description

This project examines model-specific fuel consumption ratings for new light-duty vehicles for retail sale and its estimated carbon dioxide emission in Canada in 2022-2023. For analysis, I will perform both regression and classification analysis based on CO2 emissions, CO2 Ratings and Smog Ratings.

The goal for the regression models is to predict the CO2 emissions for the new light-duty vehicles using the variables in the data set such as engine size, number of cylinders, vehicle class and make, transmission, fuel type and so on. For this analysis, I will begin with an Exploratory Data Analysis (EDA) to examine the distribution of the variables in the dataset as well as relationships between the variables. Next,I will perform Regression analysis to predict the CO2 emissions and ratings. Various methods will be used in this analysis, such as linear regression, Random Forest regression trees, and Boosted Tree.

For the classification analysis, I want to predict perform if a given light-duty vehicle has a low or high CO2/Smog Rating. Classification methods such logistic regression and classification trees. will be used for this analysis.

Finally, all the models will be summarized and compared to provide a conclusion on the model performance for predicting the variables: CO2 emissions, CO2 Rating and Smog Rating and determine what variables helped in the prediction.

The Data

This dataset has 2756 rows and 12 variables.

Data Sources

2022 fuel consumption ratings. (2022, April 6). Kaggle. https://www.kaggle.com/datasets/rinichristy/2022-fuel-consumption-ratings

Fuel consumption ratings - Open Government Portal. (n.d.). https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64

Fuel consumption ratings - 2022 Fuel Consumption Ratings (2023-08-18) - Open Government Portal. (n.d.). https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64/resource/87fc1b5e-fafc-4d44-ac52-66656fc2a245

Variables
TO PREDICT WITH
  • make: Name of the company or brand that manufactured the vehicle
  • vehicle_class: Vehicle categories based on gross vehicle weight rating (GVWR)
  • engine_size: Volume of vehicle’s cylinder/engine capacity, measures in liters(L)
  • cylinders: Number of cylinders the vehicle has
  • transmission: The type of transmission (A = automatic; AM = automated manual; AS = automatic with selector gear lever; AV = continuous variation)
  • fuel_type: Type of fuel used in the vehicle
  • fuel_consumption_city: City fuel consumption rating for gasoline mode only, measured in L/100 km
  • fuel_consumption_hwy: Highway fuel consumption rating for gasoline mode only, measured in L/100 km
  • fuel_consumption_combined: Combined fuel consumption rating for gasoline mode only, measured in L/100 km
WE WANT TO PREDICT
  • CO2_emissions: The tailpipe emissions of carbon dioxide (in grams per kilometer) for combined city and highway driving
  • CO2_rating: The tailpipe emissions of carbon dioxide rated Low (ratings from 1-5) or High(ratings from 6-10 on a scale of from 1(worst) to 10(best))
  • Smog_rating: The tailpipe emissions of smog-forming pollutants rated Low (ratings from 1-5) or High(ratings from 6-10 on a scale of from 1(worst) to 10(best))
Data Overview

From this data we can see that our variables have a variety of different values based on their types. Firstly, CO2 emissions has a mean of 259.2 g/km with a maximum of 608.0 g/km. Fuel consumption rating for city has a mean of 12.51 L/100 km while fuel consumption rating for highway has a mean of 9.36 L/100 km, resulting in an combined average fuel consumption rating of 11.09 L/100 km. Some variables had a lot of categories so their summaries are provided in the bottom table.

We also notice that CO2 rating and Smog rating variables are categorical variables with two categories: High (if their rating in 6 or above on a scale of 10) and Low (if rating is below 6).

Low and High CO2 ratings have a noticeable difference in the mean CO2 emissions. For transmission, we observe that AV transmission has a comparatively lower average CO2 emissions compared to other categories. Two seater, pickup truck and SUV standard vehicle class have comparatively higher average CO2 emissions.

View the Data Summaries

Let’s look at the range of values for each variable in the given dataset.

     make           vehicle_class       engine_size      cylinders     
 Length:2756        Length:2756        Min.   :1.000   Min.   : 3.000  
 Class :character   Class :character   1st Qu.:2.000   1st Qu.: 4.000  
 Mode  :character   Mode  :character   Median :3.000   Median : 6.000  
                                       Mean   :3.193   Mean   : 5.681  
                                       3rd Qu.:4.000   3rd Qu.: 6.000  
                                       Max.   :8.000   Max.   :16.000  
 transmission        fuel_type         fuel_consumption_city
 Length:2756        Length:2756        Min.   : 4.00        
 Class :character   Class :character   1st Qu.:10.20        
 Mode  :character   Mode  :character   Median :12.20        
                                       Mean   :12.51        
                                       3rd Qu.:14.70        
                                       Max.   :30.70        
 fuel_consumption_hwy fuel_consumption_combined CO2_emissions CO2_rating 
 Min.   : 3.90        Min.   : 4.00             Min.   : 94   High: 581  
 1st Qu.: 7.70        1st Qu.: 9.10             1st Qu.:213   Low :2175  
 Median : 9.10        Median :10.80             Median :256              
 Mean   : 9.36        Mean   :11.09             Mean   :259              
 3rd Qu.:10.70        3rd Qu.:12.90             3rd Qu.:301              
 Max.   :20.90        Max.   :26.10             Max.   :608              
 smog_rating
 High:1083  
 Low :1673  
            
            
            
            
Average CO2 Emissions by CO2 Rating
CO2_rating n mean(CO2_emissions)
High 581 177.07
Low 2175 280.88
Average CO2 Emissions by Smog Rating
smog_rating n mean(CO2_emissions)
High 1083 227.18
Low 1673 279.59
CO2 Emissions by Fuel Type
fuel_type n mean(CO2_emissions)
diesel 77 271.03
ethanol 44 292.70
premium_gasoline 1342 277.88
regular_gasoline 1293 237.52
CO2 Emissions by Transmission
transmission n mean(CO2_emissions)
A 794 286.36
AM 340 261.83
AS 1110 262.31
AV 275 174.87
M 237 245.33
CO2 Emissions by Make
make n mean(CO2_emissions)
BMW 166 273.97
Chevrolet 219 289.68
Ford 269 272.93
Others 1810 254.54
Porsche 146 283.97
Toyota 146 200.43
CO2 Emissions by Vehicle Class
vehicle_class n mean(CO2_emissions)
Compact 217 210.41
Full-size 177 255.99
Mid-size 347 230.29
Minicompact 100 274.91
Others 113 231.32
Pickup truck 379 300.89
SUV: Small 197 229.85
SUV: Standard 162 292.67
Sport utility vehicle 681 261.20
Subcompact 238 249.40
Two-seater 145 312.44
Response Variables relationships with predictors
  • We observe that 78% of the light-duty vehicles in our dataset have Low CO2 Rating (rating of 5 or below on a scale from 0 to 10). For Smog Rating, the data is a bit more balanced, with 40.5% of the data having Low Smog Rating and 59.5% having High Smog Rating.

  • Looking at the correlation matrix, we see multicollinearity issue between many of the continuous variables so I removed engine_size, Fuel consumption city, and Fuel consumption Highway for the regression models.

  • We see a slight positive skew in the CO2 emissions data, but most of the values are concentrated between 100 g/km to 400 g/km.

  • Among the potential predictors for CO2 emissions, the strongest relationships occur with the Fuel Consumption variables.

CO2 Rating
Smog Rating
CO2 Emissions
Regression Summary

To prediction of the continuous variable CO2 Emissions(CO2_emissions), first I will use a linear regression model. The results of the model are summarized below.

The full linear regression model had many non-important predictors so we ran a pruned model by only keeping those predictors that are improtant in predicting CO2 emission. However, we observed that reducing the predictors that did not help with prediction of the CO2 emission and we saw that the metrics/fit statistics remained very similar to the full model (R-square and RMSE (root mean squared error)).

Looking at the assumption check and residual plots, we observed some issues with our data. We also can see that the the Residuals vs Fitted curves has patterns. We also failed most of the assumption checks for the linear regression model. Therefore, this indicates that either we can transform the data for linear regression or predict CO2 emission using some additional models so see if we can improve the model fit.

Effect on CO2 emissions by the Predictor Variables
Variable Direction
engine_size Increase
make_Ford Increase
make_Porsche Increase
make_Toyota Decrease
vehicle_class_Minicompact Increase
vehicle_class_Others Increase
vehicle_class_Pickup.truck Increase
vehicle_class_Sport.utility.vehicle Increase
vehicle_class_SUV..Small Increase
vehicle_class_SUV..Standard Increase
vehicle_class_Two.seater Increase
transmission_AV Decrease
fuel_type_ethanol Decrease
fuel_type_regular_gasoline Decrease
Analysis Summary

We can see an R-squared of 79.4% and the residuals mostly pass the normality check but for linearity, we see that they skewed at the very end, so there are more values that are more than 0. Examining the full model, we observe that there are some predictors that are not significant in predicting the CO2 emissions, so we will created a pruned version of the model by removing non-significant/non-important predictors.

model RMSE MAE RSQ
Linear Model 28.61 21.64 0.79
The Full Regression Model Coefficients
term estimate std.error statistic p.value
(Intercept) 258.80 0.70 368.98 0.00
engine_size 46.96 0.88 53.56 0.00
make_Chevrolet -1.00 1.16 -0.86 0.39
make_Ford 7.43 1.24 6.01 0.00
make_Others -2.38 1.54 -1.55 0.12
make_Porsche 2.15 1.11 1.94 0.05
make_Toyota -5.64 1.03 -5.49 0.00
vehicle_class_Full.size -0.07 0.95 -0.07 0.94
vehicle_class_Mid.size -0.54 1.06 -0.51 0.61
vehicle_class_Minicompact 1.86 0.95 1.95 0.05
vehicle_class_Others 2.59 0.87 2.98 0.00
vehicle_class_Pickup.truck 9.53 1.30 7.32 0.00
vehicle_class_Sport.utility.vehicle 11.49 1.29 8.91 0.00
vehicle_class_Subcompact -0.78 1.02 -0.77 0.44
vehicle_class_SUV..Small 6.17 0.96 6.41 0.00
vehicle_class_SUV..Standard 5.70 0.95 5.98 0.00
vehicle_class_Two.seater 5.05 0.95 5.32 0.00
transmission_AM 0.47 0.95 0.49 0.62
transmission_AS -0.68 1.01 -0.67 0.50
transmission_AV -10.08 0.89 -11.29 0.00
transmission_M 0.58 0.89 0.65 0.52
fuel_type_ethanol -6.32 0.96 -6.56 0.00
fuel_type_premium_gasoline 2.90 2.33 1.24 0.21
fuel_type_regular_gasoline -8.17 2.21 -3.70 0.00
Analysis Summary

For this analysis we will use a pruned Linear Regression Model.Although the model’s R-squared slightly decreased, the difference is less than 0.5% and the model only consists of significant predictors after removing some of the vehicle class, make, fuel type and transmission type categories that were insignificant.

model RMSE MAE RSQ
Linear Model 28.610 21.637 0.794
Linear Final Model 28.691 21.649 0.793
Assumption 1 Check: Little to no multi-collinearity

The Variance Inflation Factor (VIF) allows us to check for collinearity amongst the X variables. A general rule is if VIF associated with a variable is > 5 or 10 then - this means we have multicollinearity. We would expect the interaction term to be highly related to the other variables. None of the values fall above 5 so we won’t remove any more variables at this point.

For linearity, it did not pass the assumption check but the residuals are mostly normally distributed since the residuals distribution follows the distribution of the normal curve.
Assumption 3 Check: Homoscedasticity of errors-contd

The bptest() from the lmtest package can also test for non-constant variances in the residuals. This test is often called the Breusch-Pagan test. The test has a null hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors.

When we conducted the bptest, the p-value was small which means that we rejected the null and concluded that the error variance changes/is non-constant. We did not pass this assumption check which is something to keep in mind during prediction.


    studentized Breusch-Pagan test

data:  reg2_fit$fit
BP = 105.46, df = 1, p-value < 2.2e-16
Assumption 4 Check: Independence of the observations

Here we can check the independence of the observations with a Durbin Watson test statistic. The Durbin Watson test computes the residual first order autocorrelation. In general values, between 1.5 to 2.5 are relatively normal and we don’t worry about them. Since the statistic is very close to 2, we don’t see a violation of independence (or evidence of autocorrelation).


    Durbin-Watson test

data:  reg2_fit$fit
DW = 1.9565, p-value = 0.1693
alternative hypothesis: true autocorrelation is greater than 0
The Final Regression Model Coefficients
term estimate std.error statistic p.value
(Intercept) 258.79 0.70 369.55 0.00
engine_size 47.20 0.82 57.69 0.00
make_Ford 8.58 0.80 10.69 0.00
make_Porsche 3.51 0.82 4.27 0.00
make_Toyota -4.69 0.74 -6.31 0.00
vehicle_class_Minicompact 2.17 0.83 2.61 0.01
vehicle_class_Others 2.76 0.74 3.72 0.00
vehicle_class_Pickup.truck 9.43 0.88 10.69 0.00
vehicle_class_Sport.utility.vehicle 11.61 0.81 14.41 0.00
vehicle_class_SUV..Small 6.33 0.77 8.27 0.00
vehicle_class_SUV..Standard 5.71 0.75 7.60 0.00
vehicle_class_Two.seater 5.37 0.76 7.05 0.00
transmission_AV -10.09 0.76 -13.20 0.00
fuel_type_ethanol -6.94 0.83 -8.35 0.00
fuel_type_regular_gasoline -10.67 0.88 -12.12 0.00
Compare actual (CO2_emissions) vs predicted (y_hat) for pruned regression model
Regression Tree Summary

After examining the Regression Tree, tuned Random Forest trees as well as the tuned boosted tree, we can see that the most important variables for the Regression, RF and Boosted tree are fuel_consumption_combined (fuel consumption combined for city + highway) and engine_size. RF and Boosted trees had a better fit compared to the regression tree. The next most important variables for RF and Boosted trees are cylinders, and transmission_AV. We can see that

  • if there are higher combined fuel consumption for city and highway, the CO2_emissions for the vehicle tends to be higher.
  • The Boosted tree is suggested since it performs the best out of all the regression trees.
Analysis Summary

We will predict the Median Value with all the variables.

model RMSE MAE RSQ
Linear Model 28.61 21.64 0.79
Linear Final Model 28.69 21.65 0.79
Reg Tree Model 16.57 11.65 0.93
View the Regression Tree and Variable Importance

We see that the regression tree has 9 leaf nodes.

Compare actual (CO2_emissions) vs predicted (y_hat)
Analysis Summary

We will predict the CO2 emissions of vehicles using all the variables in the Random Forest Model.

  • Mtry of 15 and 1500 trees were selected as parameters for the model.
model RMSE MAE RSQ
Linear Model 28.610 21.637 0.794
Linear Final Model 28.691 21.649 0.793
Reg Tree Model 16.566 11.652 0.931
Tuned RF Tree Model 3.772 1.538 0.996
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
CO2_emissions ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 15
  trees = 1500

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 
View the Variable Importance

Fuel consumption combined and engine size are the most important predictors in our RF model.

Compare actual (CO2_emissions) vs predicted (y_hat)
Analysis Summary

Let’s look at a boosted tree to see if our metrics/results improve.

  • The parameters chosen for the boosted model are mtry = 12, trees = 500, min_n = 5, tree_depth = 9,learn_rate = 0.0328 and loss_reduction = 0.240.
model RMSE MAE RSQ
Linear Model 28.610 21.637 0.794
Linear Final Model 28.691 21.649 0.793
Reg Tree Model 16.566 11.652 0.931
Tuned RF Tree Model 3.772 1.538 0.996
Tuned Gradient Boosted Tree Model 2.344 1.371 0.999
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
CO2_emissions ~ .

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)

Main Arguments:
  mtry = 12
  trees = 500
  min_n = 5
  tree_depth = 9
  learn_rate = 0.0328447200747419
  loss_reduction = 0.239886500258523

Computational engine: xgboost 
View the Variable Importance

Fuel consumption combined and engine size are the most important predictors in the model.

Compare actual (CO2_emissions) vs predicted (y_hat) tuned tree
Classification Models

We are using the classification models to predict the high/low CO2 Rating.For the logistic regression, we also predicted the high/low Smog rating. These were coded to categorical in the earlier steps where High means a rating of 6 and above, while low is otherwise.

For this analysis, we will perform a logistic regression and then the classification tree.

  • We observed that the sensitivity of original models for CO2_rating classification were around 99.4%.This means that out of all the vehicles that were actually High rating, 99.4% of them were correctly predicted as High by the model.

  • We saw an accuracy of 98.9% for the best model. This means that ratings 98.9% of the observations were correctly predicted as their actual rating(both High and Low).

  • Using the best cutoffs for the models, the sensitivity increased to 100%.

  • The model I would choose for the classification is because it is easy to explain. The parameters for the classification tree was const complexity of 0.1 and tree depth of 4.

Classification Tree Summary

We will use all the variables except CO2_emissions which is median value because this is what the medvHigh is created from. For this model we will set the cost complexity to .001.

          Truth
Prediction High Low
      High  174   8
      Low     1 645
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Classification Tree CO2 rating Model 0.989 0.994 0.988 0.991 0.956
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
CO2_rating ~ .

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  cost_complexity = 0.1
  tree_depth = 4

Computational engine: rpart 
View the Classification Tree and Variable Importance

We can see we have 2 leaf nodes. The higher the vip value, the more important the predictor is for classification.

View the ROC Curve
Best Threshold
Best_Cutoff Sensitivity Specificity AUC_for_Model
0.96 0.99 0.99 0.99
          Truth
Prediction High Low
      High  174   8
      Low     1 645
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Classification Tree CO2 rating Model 0.989 0.994 0.988 0.991 0.956
Classification Tree Model Best Cutoff 0.96 0.989 0.994 0.988 0.991 0.956
Logistic Summary- CO2 Rating

For our final model, we will use logistic regression to explore two response variables- CO2_Rating and Smog Rating.

We observed that fuel consumption combined (city + highway) and Vehicle Class SUV: Small are most important in the model along with the full logistic regression equation.

term estimate std.error statistic p.value
(Intercept) 38.85 255.66 0.15 0.88
engine_size -1.14 1.49 -0.76 0.45
fuel_consumption_combined 51.08 8.75 5.84 0.00
make_Chevrolet -0.84 0.54 -1.57 0.12
make_Ford 0.34 1.23 0.28 0.78
make_Others -0.19 0.45 -0.42 0.67
make_Porsche -0.58 652.62 0.00 1.00
make_Toyota 0.23 0.83 0.27 0.79
vehicle_class_Full.size -0.37 0.38 -0.97 0.33
vehicle_class_Mid.size -0.40 0.37 -1.08 0.28
vehicle_class_Minicompact -0.08 0.37 -0.21 0.83
vehicle_class_Others -0.30 0.29 -1.02 0.31
vehicle_class_Pickup.truck -0.28 506.81 0.00 1.00
vehicle_class_Sport.utility.vehicle -0.33 0.53 -0.62 0.54
vehicle_class_Subcompact -0.09 0.34 -0.27 0.79
vehicle_class_SUV..Small -0.68 0.34 -1.97 0.05
vehicle_class_SUV..Standard 0.05 119.53 0.00 1.00
vehicle_class_Two.seater -0.18 0.95 -0.19 0.85
transmission_AM -0.51 0.37 -1.38 0.17
transmission_AS -0.95 0.53 -1.80 0.07
transmission_AV -0.41 0.40 -1.01 0.31
transmission_M -0.32 0.44 -0.73 0.47
fuel_type_ethanol -10.83 573.16 -0.02 0.98
fuel_type_premium_gasoline -10.35 2285.60 0.00 1.00
fuel_type_regular_gasoline -9.88 2282.03 0.00 1.00
Pruned Logistic Regression Equation
term estimate std.error statistic p.value
(Intercept) 19.48 2.02 9.62 0.00
fuel_consumption_combined 24.74 2.60 9.52 0.00
transmission_AS -0.21 0.18 -1.15 0.25
vehicle_class_SUV..Small -0.18 0.14 -1.29 0.20
Metrics, and VIP Plot
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Classification Tree CO2 rating Model 0.99 0.99 0.99 0.99 0.96
Classification Tree Model Best Cutoff 0.96 0.99 0.99 0.99 0.99 0.96
Pruned CO2 Rating Logistic Model 0.99 0.97 0.99 0.98 0.96
          Truth
Prediction High Low
      High  170   7
      Low     5 646
View the ROC Curve
Best Threshold
Best_Cutoff Sensitivity Specificity AUC_for_Model
0.47 0.99 0.99 1
          Truth
Prediction High Low
      High  174   9
      Low     1 644
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Classification Tree CO2 rating Model 0.9891 0.9943 0.9877 0.9910 0.9560
Classification Tree Model Best Cutoff 0.96 0.9891 0.9943 0.9877 0.9910 0.9560
Pruned CO2 Rating Logistic Model 0.9855 0.9714 0.9893 0.9804 0.9605
Logistic Model Best Cutoff 0.47 0.9879 0.9943 0.9862 0.9903 0.9508
Logistic Summary-Smog Rating

For our final model, we will use logistic regression to also explore the Smog rating variable.

In this model, we can see that combined fuel consumption and transmission AM are the most important predictors. The coefficients of the equation are given below.

term estimate std.error statistic p.value
(Intercept) 2.06 23.68 0.09 0.93
engine_size 0.45 0.13 3.48 0.00
fuel_consumption_combined 1.71 0.17 10.11 0.00
make_Chevrolet -0.50 0.11 -4.65 0.00
make_Ford -0.38 0.11 -3.37 0.00
make_Others -0.34 0.15 -2.32 0.02
make_Porsche 3.24 80.59 0.04 0.97
make_Toyota -0.15 0.09 -1.72 0.09
vehicle_class_Full.size 0.03 0.08 0.31 0.76
vehicle_class_Mid.size -0.06 0.08 -0.76 0.45
vehicle_class_Minicompact -0.05 0.09 -0.54 0.59
vehicle_class_Others -0.10 0.06 -1.59 0.11
vehicle_class_Pickup.truck -0.89 0.11 -8.13 0.00
vehicle_class_Sport.utility.vehicle -0.46 0.11 -4.37 0.00
vehicle_class_Subcompact -0.03 0.08 -0.41 0.68
vehicle_class_SUV..Small -0.15 0.07 -2.07 0.04
vehicle_class_SUV..Standard -0.60 0.08 -7.39 0.00
vehicle_class_Two.seater 0.02 0.10 0.20 0.84
transmission_AM 0.82 0.09 9.06 0.00
transmission_AS 0.45 0.08 5.44 0.00
transmission_AV 0.30 0.08 3.92 0.00
transmission_M 0.26 0.07 3.53 0.00
fuel_type_ethanol -2.58 63.08 -0.04 0.97
fuel_type_premium_gasoline -10.12 251.56 -0.04 0.97
fuel_type_regular_gasoline -9.90 251.16 -0.04 0.97
Pruned Logistic Regression Equation
term estimate std.error statistic p.value
(Intercept) 0.76 0.06 12.09 0.00
engine_size 0.49 0.11 4.32 0.00
fuel_consumption_combined 1.07 0.13 8.54 0.00
make_Chevrolet -0.48 0.09 -5.23 0.00
make_Ford -0.42 0.09 -4.58 0.00
make_Others -0.49 0.13 -3.91 0.00
make_Toyota -0.22 0.08 -2.90 0.00
vehicle_class_Pickup.truck -0.50 0.07 -7.18 0.00
vehicle_class_Sport.utility.vehicle -0.20 0.06 -3.33 0.00
vehicle_class_SUV..Standard -0.39 0.06 -6.70 0.00
transmission_AM 0.55 0.07 7.49 0.00
transmission_AS 0.16 0.07 2.47 0.01
transmission_M 0.15 0.06 2.40 0.02
Metrics and VI Plot
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Pruned Smog Rating Logistic Model 0.74 0.64 0.8 0.72 0.67
          Truth
Prediction High Low
      High  208 102
      Low   117 400
View the ROC Curve
Best Threshold
Best_Cutoff Sensitivity Specificity AUC_for_Model
0.3 0.89 0.64 0.81
          Truth
Prediction High Low
      High  288 182
      Low    37 320
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Pruned Smog Rating Logistic Model 0.74 0.64 0.80 0.72 0.67
Logistic Model Smog Rating Best Cutoff 0.3 0.74 0.89 0.64 0.76 0.61
Summary

In Conclusion, we can see that our predictors do help to predict the median value, either the high/low median value (with cutoff at $30,000) or the actual median values.

Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:

Decrease_CO2_emissions Increase_CO2_emissions
Vehicles manufactured by Toyota engine size of vehicles
Vehicles with AV(continuous variation) transmission Vehicles manufactured by Ford and Porsche
Vehicles using ethanol fuel Vehicle class such as minicompact, pickup truck, sport utility, SUV & Two seater
Vehicles using regular gasoline fuel
Predicting Continuous CO2 emissions

In addition, if we compare the models that we examined for predicting continuous CO2 emissions, we see that the Tune Random Forest and the Gradient Boosted Tree performed much better than the linear regression and Regression Tree models.

  • Final Linear Regression MAE: 21.65
  • Regression Tree MAE: 11.65
  • Tuned RF Tree MAE: 1.54
  • Tuned Gradient Boosted Tree MAE: 1.31
model RMSE MAE RSQ
Linear Model 28.61 21.64 0.79
Linear Final Model 28.69 21.65 0.79
Reg Tree Model 16.57 11.65 0.93
Tuned RF Tree Model 3.77 1.54 1.00
Tuned Gradient Boosted Tree Model 2.34 1.37 1.00
Compare actual (CO2_emissions) vs predicted (y_hat) tuned tree
Predicting CO2_rating and Smog Rating Value

Predicting Categorical CO2 Rating

Comparing the models we examined for predicting the categorical response CO2 rating, we observed that they are similar but the classification tree has higher precision,accuracy and specificity and similar sensitivity to the best logistic model.

  • Classification Tree (cutoff 0.96) Accuracy .989 Sensitivity .994 Specificity 0.988 Precision .956
  • Logistic Regression (cutoff 0.47) Accuracy .988 Sensitivity .994 Specificity 0.986 Precision .951
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Classification Tree CO2 rating Model 0.99 0.99 0.99 0.99 0.96
Classification Tree Model Best Cutoff 0.96 0.99 0.99 0.99 0.99 0.96
Pruned CO2 Rating Logistic Model 0.99 0.97 0.99 0.98 0.96
Logistic Model Best Cutoff 0.47 0.99 0.99 0.99 0.99 0.95

Predicting Categorical Smog Rating

Looking at the logistic model for Smog rating, we observed that best cutoff this model has a higher sensitivity, accuracy and average sensitivity+ specificity, but the specificity and precision decreased.

  • Logistic Regression (cutoff 0.3) Accuracy .74 Sensitivity .89 Specificity 0.64 Precision .61
model Accuracy Sensitivity Specificity Avg_Sens_Spec Precision
Pruned Smog Rating Logistic Model 0.74 0.64 0.80 0.72 0.67
Logistic Model Smog Rating Best Cutoff 0.3 0.74 0.89 0.64 0.76 0.61
ROC Curves
Reflection

Q. What did work hardest on or are you most proud of in your project?

  • Formatting the dashboard was something I was worked the hardest on and was proud of in my project. I think it is very important to learn to communicate effectively and formatting this project as a dashboard made us all think about ways to choose and organize information to communicate the results effectively.

Q. What would you do if you had another week to work on the project?

  • If I had one more week to work on the project, for the classification models, I would like to try more models for the Smog rating variable. Since the CO2 rating response variable barely had any significant predictors, I tried to see if results for a different categorical response was better. I think it would have been interesting to look at classification trees for Smog Rating or even RF and Boosted Tree models.